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Abstract 

With  the  explosion  of  imaging  applications,  and  due  to  the  massive  amounts  of  imagery 
data,  data  compression  is  essential.  Lossless  compression,  also  called  entropy  coding,  is  of  special 
importance  because  it  is  used  not  only  for  compression  of  text  files  and  medical  images,  but  also 
as  an  inherent  part  of  lossy  compression.  Therefore,  fast  entropy  coding/decoding  algorithms 
are  desirable.  In  this  paper  we  will  develop  parallel  algorithms  for  several  widely  used  entropy 
coding  techniques,  namely,  arithmetic  coding,  run-length  encoding  (RLE),  and  Huffman  coding. 
Our  parallel  arithmetic  coding  algorithm  takes  0(log2  TV ) time  on  an  TV-processor  hypercube, 
where  TV  is  the  input  size.  For  RLE,  our  parallel  coding  and  decoding  algorithms  take  O(logTV) 
time  on  an  TV-processors  computer.  Finally,  in  the  case  of  Huffman  coding,  the  parallel  coding 
algorithm  takes  0(log2  TV  + nlogn),  where  n is  the  alphabet  size,  n « TV.  As  for  decoding, 
both  arithmetic  and  Huffman  are  hard  to  parallelize.  However,  special  provisions  could  be 
made  in  many  applications  to  make  arithmetic  and  Huffman  decoding  fairly  parallel. 

Keywords:  Arithmetic  Coding,  Run-Length  Encoding,  Huffman  Coding,  Decoding,  Par- 
allel Algorithms,  Hypercube,  Statistics  Gathering. 


1 Introduction 


With  the  explosion  of  imaging  applications,  and  due  to  the  massive  amounts  of  imagery  data, 
data  compression  is  essential  to  reduce  the  storage  and  transmission  requirements  of  images  and 
videos  [14].  Indeed,  due  to  the  critical  importance  of  compression,  there  are  several  international 
organizations  that  develop  compression  standards.  Among  the  most  notable  standards  are 
JPEG  [11]  for  still  images  and  MPEG  for  videos  [5]. 

Compression  can  be  lossless  or  lossy.  Lossless  compression,  also  called  entropy  coding, 
allows  for  perfect  reconstruction  of  the  data,  whereas  lossy  compression  does  not.  Even  in  lossy 
compression,  which  is  by  far  more  prevalent  in  image  and  video  compression,  entropy  coding  is 
needed  as  a last  stage  after  the  data  has  been  transformed  and  quantized  [14,  18].  Therefore, 
fast  entropy  coding  algorithms  are  of  prime  importance,  especially  in  online  or  even  real-time 
applications  such  as  video  teleconferencing. 

Parallel  algorithms  are  an  obvious  choice  for  fast  processing.  Therefore,  in  this  paper  we 
will  develop  parallel  algorithms  for  several  widely  used  entropy  coding  techniques,  namely, 
arithmetic  coding  [13],  run-length  encoding  (RLE)  [12,  16],  and  Huffman  coding  [4],  Our 
parallel  arithmetic  coding  algorithm  takes  0(log2  TV)  time  on  an  TV-processor  hypercube,  where 
TV  is  the  input  size.  The  time  is  dominated  by  sorting,  for  otherwise  it  takes  O(logTV)  time. 
Unfortunately,  arithmetic  decoding  seems  to  be  hard  to  parallelize  because  it  is  a sequential 
process  of  essentially  logical  computations.  In  practice,  however,  files  are  broken  down  into 
many  substrings  before  being  arithmetic-coded,  for  precision  reasons  that  will  become  clear 
later  on.  Accordingly,  the  coded  streams  of  those  substrings  can  be  decoded  in  parallel. 

1This  research  was  performed  in  part  at  the  National  Institute  of  Standards  and  Technology. 
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For  RLE,  we  design  parallel  algorithms  for  both  encoding  and  decoding,  each  taking  0(log  N) 
time.  Finally,  in  the  case  of  Huffman  coding,  the  algorithm  is  easily  data-parallel;  the  devel- 
opment of  the  Huffman  tree  to  determine  the  codeword  of  each  symbol  of  the  alphabet  is  the 
only  sequential  part,  but  its  time  complexity  is  often  insignificant  because  the  alphabet  size 
is  typically  small  — only  in  the  order  of  tens  or  at  most  hundreds  of  symbols.  The  statistics 
gathering  for  computing  symbol  probabilities  needed  for  the  Huffman  tree  is  parallelized  to  take 
0(log2  AT)  time.  Like  arithmetic  decoding,  Huffman  decoding  is  highly  sequential.  However,  in 
certain  applications  where  the  data  is  inherently  broken  into  many  blocks  that  are  processed 
independently  as  in  JPEG/MPEG,  simple  provisions  can  be  made  to  have  the  bitstreams  easily 
separable  into  many  independent  substreams  that  can  be  decoded  independently  in  parallel. 

It  must  be  noted  that  other  lossless  compression  techniques  are  also  in  use  such  as  Lempel- 
Ziv  [19],  bit-plane  coding  [15],  and  differential  pulse-code  modulation  (DPCM)  [10].  The  first 
two  will  not  be  considered  here  for  two  reasons.  First,  they  are  not  usually  used  in  the  en- 
tropy coding  stage  of  lossy  compression.  Second,  Lempel-Ziv  coding  seems  to  be  inherently 
sequential,  and  bit-plane  coding  involves  essentially  RLE  and  Huffman  coding,  both  of  which 
are  covered  independently  in  this  paper.  The  last  technique,  DPCM,  involves  the  computation 
of  multidimensional  recurrence  relations,  and  is  the  subject  of  another  paper  by  this  author. 

The  paper  is  organized  as  follows.  The  next  section  gives  a brief  description  of  the  various 
standard  parallel  operations  that  will  be  used  in  our  algorithms.  Section  3 develops  a parallel 
algorithm  for  arithmetic  coding.  Section  4 develops  parallel  encoding  and  decoding  algorithms 
for  RLE.  Section  5 addresses  the  parallelization  of  Huffman  coding  and  decoding.  Conclusions 
and  future  directions  are  given  in  section  6. 


2 Preliminaries 


The  parallel  algorithms  designed  in  this  paper  use  several  standard  parallel  operations.  The 
following  is  a list  of  those  operations  along  with  a brief  description. 


• Parsort(Y[0  : AT—  1];  Z[ 0 : N—  1],  7r[0  : TV—  1]):  It  sorts  in  parallel  the  input  array  Y into 
Z,  and  records  the  permutation  n that  orders  Y to  Z : Z[k\  = Y[7r[A;]].  It  uses  Batcher's 
bitonic  sorting  [2],  which  takes  0(log2  N)  time  on  an  ./V-processor  hypercube.  The  choice 
is  justified  because  other  practical  parallel  sorting  algorithms  are  slower,  and  the  O(logTV) 
time  sorting  algorithms  [1]  are  not  practical  due  to  their  high  constant  factor. 

• C=Parmult(i4o:jv_i):  It  multiplies  the  N elements  of  the  array  A,  yielding  the  product 
C.  In  this  paper,  the  elements  of  A are  2x2  matrices.  This  operation  clearly  takes  simply 
0(logN ) time  on  O(N)  processors  connected  as  a hypercube  [3]. 

• A[0,  A^  - l]=Parprefix(ao:Ar-i):  This  is  the  well-known  parallel  prefix  operation  [3,  7,  9]. 
It  computes  from  the  input  array  a the  array  A where  A[i\  = a[0]  + a[l]  + ...  + a[z],  for  all 
i — 0, 1, ...,  N — 1.  Parallel  prefix  takes  OfiogN ) time  on  a ./V-processor  hypercube. 

• T[0  : N — l]=Barrier-Parprefix(a[0  : N — 1]):  This  operation  assumes  that  the  input 
array  a is  divided  into  groups  of  consecutive  elements;  every  group  has  a left-barrier  at  its 
start  and  a right-barrier  at  its  end.  Barrier-Parprefix  performs  a parallel  prefix  within 
each  group  independently  from  other  groups.  It  takes  O(logAT)  time  on  an  ./V-processor 
hypercube.  To  see  this,  let  /[0  : N — 1]  be  a flag  array  where  f[k]  — 0 if  k is  a right-barrier, 
and  f[k]  = 1,  otherwise.  Clearly,  A[i]  — f[i  — 1 ]A\i  — 1]  + a\i  for  all  i.  The  latter  is  a 
linear  recurrence  relation  solvable  in  0{\ogN)  time  on  an  ./V-processor  hypercube  [6]. 
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3 Parallel  Arithmetic  Coding 


Arithmetic  coding  [13]  relies  heavily  on  the  probability  distributions  of  the  input  files  to  be 
coded.  Essentially,  arithmetic  coding  maps  each  input  file  to  a subinterval  [L  R\  of  the  unit 
interval  [0  1]  such  that  the  probability  of  the  input  file  is  R — L.  Afterwards,  it  represents  the 
fraction  value  L in  n-ary  using  r = [—log n(R  — L)]  n-ary  digits,  where  n is  the  size  of  the 
alphabet.  The  stream  of  those  r digits  are  taken  to  be  the  code  of  the  input  file. 

The  mapping  of  an  input  file  into  a subinterval  [L  i?]  is  done  progressively  by  reading  the 
file  and  updating  the  [L  R]  subinterval  to  always  be  the  corresponding  subinterval  of  the  input 
substring  scanned  so  for.  The  update  rule  works  as  follows.  Assume  that  the  input  file  is  the 
string  x[0  : N— 1]  where  every  symbol  is  in  the  alphabet  {ao,  ai, ...,  an_i},  and  that  the  substring 
x[0  : k — 1]  has  been  processed,  i.e.,  mapped  to  interval  [L  7?].  Let  Pki  be  the  probability  that 
the  next  symbol  is  a*  given  that  the  previous  symbols  are  a:[0  : k — 1].  Divide  the  current 
interval  [L  R\  into  n successive  subintervals  where  the  i- th  subinterval  is  of  length  Pki[R  — L)  for 
i = 0, 1, ...,  n — 1,  that  is,  the  z-th  subinterval  is  [Li  Ri]  where  Li  = (PkQ+ Pki  + ...+ Pk^-i)(R— L) 
and  Ri  = + Pki(R  — L).  Finally,  if  the  next  symbol  in  the  input  file  is  aio  for  some  z0,  the 

update  is  L = Lio  and  R = Ri0.  The  last  value  of  the  interval  [L  R]  after  the  whole  input  string 
has  been  processed  is  the  desired  subinterval. 

The  alphabet  {ao,  ai, ...,  an_i}  can  be  arbitrary.  Some  of  the  typical  alphabets  are  the  binary 
alphabet  {0, 1}  for  binary  input  files,  the  ascii  alphabet,  and  any  finite  set  of  real  numbers  or 
integers  as  may  occur  in  run-length  encoding.  In  the  last  category,  the  alphabet  {a0,  ai: ...,  an_i} 
can  be  easily  mapped  to  the  more  convenient  alphabet  {0, 1,  ...,n  — 1}.  That  mapping  is  applied 
at  the  outset  before  arithmetic  coding  starts,  and  the  inverse  mapping  is  applied  after  arithmetic 
decoding  is  completed.  Henceforth,  we  will  assume  the  alphabet  to  be  {0, 1,  ...,n  — 1}. 

The  conditional  probabilities  {Pkt}  are  either  computed  statistically  from  the  input  file  or 
derived  from  an  assumed  theoretical  probabilistic  model  about  the  input  files.  Naturally,  the 
statistical  method  is  the  one  used  most  often,  and  will  be  assumed  here.  The  structure  of  the 
probabilistic  model  is,  however,  still  useful  in  knowing  what  statistical  data  should  be  gathered. 
The  model  often  used  is  the  Markov  model  of  a certain  order  m,  where  m tends  to  be  fairly 
small,  in  the  order  of  1-5.  That  is,  the  probability  that  the  next  symbol  is  of  some  value  a 
depends  on  only  the  values  of  the  previous  m symbols.  Therefore,  to  determine  statistically  the 
probability  that  the  next  symbol  is  a given  that  the  previous  m symbols  are  some  6i62...6m,  it 
suffices  to  compute  the  frequency  of  occurrences  of  the  substring  6162...6ma  in  the  input  string, 
and  normalize  that  frequency  by  N , which  is  the  total  number  of  substrings  of  length  m + 1 
symbols  in  the  zero-padded  input  string.  The  padding  of  m 0’s  to  the  left  of  x is  taken  to 
simplify  the  statistics  gathering  at  the  left  boundary  of  x:  assume  that  the  imaginary  symbols 
x[—m  : —1]  are  all  0. 

To  summerize,  the  sequential  algorithm  for  computing  the  statistical  probabilities  and  per- 
forming arithmetic  coding  is  presented  next. 


Algorithm  Arithmetic-coding(input:  x[0  : N - 1];  output:  B) 
begin  /*  The  alphabet  is  assumed  to  be  known,  say  {0, 1,  ...,n  — 1}  */ 

Phase  I:  Statistics  Gathering 

for  k = 0 to  N — 1 do  /*  compute  the  probabilities  {Pki}'s  which  are  initialized  to  0*/ 
compute  the  frequency  fk  of  the  substring  x[k  — m : k\  in  the  whole  string  x ; 
set  Qk  = fk/N ; 
let  i = x[k],  and  set  Pki  = Qk ; 
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endfor 

for  k = 0 to  TV  — 1 do  /*  compute  the  probabilities  { Pk}'s  which  are  initialized  to  0 */ 
Let  i = x[k]:  and  set  Pk  = PkQ  + Pkl  + ...  + Pk,i~i] 

endfor 

Phase  II:  finding  the  interval  [L  R)  corresponding  to  the  string  x 
Initialize:  L — 0 and  R = 1; 
for  k = 0 to  N — 1 do 
D = R-L ; 

L = L + PkD; 

R — L + QkD\ 
endfor 

Phase  III:  computing  the  output  stream  B as  a code  for  the  input  string  x 

r = \-\ogn(R-m 

Take  the  n-ary  representation  of  L — O.LiL2...Lr...] 

B = [Li  L2  ...  Lr ]; 

end 

To  parallelize  the  Arithmetic-coding  algorithm,  the  first  two  phases  have  to  be  parallelized. 
Note  that  in  Phase  III,  L is  naturally  represented  in  binary  inside  the  computer,  so  phase  III 
is  nothing  more  than  chopping  off  the  first  [Ylogn]  bits  of  the  binary  representation  of  L. 

Parallelization  of  Phase  I:  Statistics  Gathering 

Each  substring  x[k  — m : k]  is  treated  as  an  (m  + l)-tuple  of  integer  components,  for  k = 
0, 2, ...,  N—  1;  those  N tuples  are  denoted  as  an  array  Y[0  : A"— 1].  Denote  the  m+1  components 
of  y[Jfc]  as  (Ym[k],...,Y1[k],Y0[k\),  that  is,  Y0[k]  = x[k],  Y\[k]  = x[k  — 1],  ...,  Ym[k]  = x[k  — m\. 
Sort  Y into  Z using  Parsort(y;  Z,  n),  where  Z[k\  = Y[-7r[A:]].  Clearly,  all  identical  tuples  are 
consecutive  in  Z.  Associate  Qn[k]  and  Pn[k],z0[k]  with  tuple  Z[k\  = Y[7t[/c]]. 

We  will  divide  Z into  segments  and  supersegments.  A segment  is  any  maximal  subarray  of 
identical  consecutive  elements  of  Z.  A supersegment  is  a maximal  set  of  consecutive  segments 
where  the  tuple  values  differ  in  at  most  the  rightmost  component.  The  probabilities  Qk  s and 
Pk^c[k]  s are  then  computed  as  follows: 

Procedure  Compute-probs(input:  Z[ 0 : N - l],7r[0  : N — 1];  output:  Qk  S,  Pk,x[k]  s) 

begin 

1.  Put  a left-barrier  and  a right-barrier  at  the  beginning  and  at  the  end  of  every  segment, 
respectively.  It  can  be  done  in  the  following  way.  First,  put  a left  barrier  at  k = 0 and  a 
right  barrier  at  k — N — 1.  Afterwords,  do 

for  k = 0 to  N — 2 pardo 

if  Z[k\  < Z[k  + 1],  put  a right  barrier  at  k and  a left  barrier  at  k + 1. 

endfor 

2.  Let  <?[0  : N — 1]  be  an  integer  array  where  every  term  is  initialized  to  1; 

3.  G[0  : N — l]=Barrier-Parprefix(g). 

Clearly,  if  k corresponds  to  a right  barrier  of  a segment,  then  G[k\  is  the  number  of  terms 
of  that  segment,  that  is,  G[k\  is  the  frequency  of  Z[k\  = Y[7r[fc]J. 
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4.  Broadcast  within  each  segment  the  G[k\  of  the  segment’s  right  barrier,  and  then  set  in 
parallel  every  G[i\  term  in  the  segment  to  G[k]. 

5.  for  k = 0 to  N — 1 pardo 

set  Qv[k]  = G[k\/N  and  P^Yo^m  = G[k]/N. 

endfor 

end 

Observe  that  the  Qn[k}  s within  any  one  single  segment,  and  therefore  the  P+^rof+^l’s  within 
any  segment,  are  all  equal.  We  call  Q^h)  (or  Pn[k:},Yo[n{k]})  the  probability  of  that  segment. 
Observe  also  that  the  cumulative  probability  Pk , which  is  defined  as  Pko  + Pki  + ■■■  + Pk,x[k)- i,  is 
the  sum  of  probabilities  of  all  tuples  where  the  m leftmost  symbols  are  equal  to  x[k  — m:k  — 1] 
and  where  the  rightmost  symbol  is  < rc[fc]  — 1.  Stated  otherwise,  P^k]  is  the  sum  of  the 
probabilities  of  all  the  segments  within  the  supersegment  containing  k such  that  the  m leftmost 
symbols  are  equal  to  those  of  Y[7r[fc]]  and  the  rightmost  symbol  is  < rr  [7r[/c]]  — 1.  The  following 
procedure  will  compute  those  cumulative  probabilities  Pk  s. 

Procedure  Compute-cumprobs(input:  Z , 7r,  Qk  s,  Phi  s;  output:  Pk  s) 
begin 

1.  Put  a left-barrier  and  a right-barrier  at  the  beginning  and  at  the  end  of  every  superseg- 
ment, respectively,  as  follows.  For  each  /c,  denote  by  Z'[k]  the  m-tuple  consisting  of  the 
m leftmost  components  of  Z[k\,  that  is,  Z'[k\  is  all  but  the  rightmost  component  of  Z[k). 
Put  a left  barrier  at  k = 0 and  a right  barrier  at  k = N — 1.  Afterwords,  do 

for  k = 0 to  TV  — 2 pardo 

if  Z'[k\  < Z'[k  + 1],  put  a right  barrier  at  k and  a left  barrier  at  k + 1. 

endfor 

2.  Let  h[ 0 : N — 1]  be  a real  array  where  every  term  is  initialized  to  0;  for  each  k = 
0,1,2,...,  ./V  — 1 , h[k]  is  associated  with  Z[k\. 

for  k = 0 to  N — 1 pardo 

if  k happens  to  be  the  start  of  a segment  (rather  than  a supersegment),  then 
Set  h[k]  P-rv[k\,Zo[k\  • 

endfor 

3.  H[ 0 : N — l]=Barrier-Parprefix(/i),  using  the  supersegment  barriers. 

It  can  be  easily  shown  that  H[k]  = the  sum  of  the  probabilities  of  all  the  segments 
within  the  supersegment  containing  k such  that  the  m leftmost  symbols  are  equal  to 
those  of  y[7r[/:]]  and  the  rightmost  symbol  is  < x[7r[/c]].  After  the  discussion  above, 
m = p*  [*]  + Q^ik}-  This  justifies  the  next  step. 

4.  for  k = 0 to  N — 1 pardo 

P-n[k]  — -^[^j  Qn[k]- 

endfor 

end 

Time  Analysis  of  Phase  I 

For  each  k , assume  that  x[k),  y[fc]  and  Z[k]  will  be  hosted  by  processor  k.  The  gathering  of 
x[k  — m : k]  to  processor  k to  form  Y[k\  requires  m shifts  that  send  data  from  node  i to  node 
f + 1 for  all  i.  Each  shift  takes  O(logiV)  communication  time  on  an  N-processor  hypercube. 
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Therefore,  the  forming  of  Y takes  0(m  log N)  — 0(\ogN ) communication  time  for  the  m shifts. 
The  reason  m was  dropped  from  the  time  formula  is  because  m is  fairly  small,  in  the  order  of 
1 — 5 usually,  and  thus  is  assumed  to  be  a constant. 

Parsort  takes  OQog2  N)  time  on  an  V-processor  hypercube.  Because  of  the  special  suit- 
ability of  hypercubes  for  bitonic  sorting,  the  architecture  for  arithmetic  coding  will  be  assumed 
to  be  an  iV-processor  hypercube. 

Procedure  Compute-probs  will  be  shown  to  take  0(log  N)  time.  Step  1 involves  an  exchange 
of  the  values  Z[k\  and  Z[k  + 1]  between  processors  k and  k + 1,  for  all  k.  This  is  accomplished 
by  two  shifts:  one  from  k to  k + 1 and  the  other  from  k + 1 to  k,  for  all  k.  Thus,  this  step  takes 
O(logiV)  time.  Step  2 takes  0(1)  time.  Step  3,  Barrier-Parprefix,  takes  O(logiV)  time.  Step 
4,  being  several  independent  broadcasts  within  nonoverlapping  portions  of  the  hypercube,  also 
takes  0(\ogN ) time.  Finally,  step  5 takes  0(1)  time  because  it  is  a simple  parallel  step.  This 
establishes  that  the  whole  procedure  takes  O (log  AT)  parallel  time. 

The  analysis  of  the  procedure  Compute-cumprobs  is  very  similar,  and  shows  that  it  takes 
0(\ogN)  parallel  time  as  well. 

It  must  be  noted  that  after  executing  the  last  two  procedures,  the  probabilities  P k and 
Qk  are  to  be  sent  to  processor  k , for  each  k — 0,1,...,  AT  — 1.  At  present,  P^k  i and  Q^k] 
are  in  processor  k along  with  Z[k\.  Therefore,  for  all  k , processor  k sends  Pn[k\  and  Qn[-n[k]  to 
processor  7 r[k\.  That  is,  this  communication  step  is  just  a permutation  routing  of  7r.  If  routed 
using  Valiant’s  randomized  routing  algorithm  [17],  it  will  take  0(log N)  communication  time 
with  overwhelming  probability.  Otherwise,  n can  be  routed  by  bitonic  sorting  of  its  destinations, 
taking  0(log2  N ) time. 

In  conclusion,  the  statistics  gathering  process  takes  0(log27V)  parallel  time  for  both  com- 
munication and  computation.  It  remains  to  parallelize  Phase  II  of  arithmetic  coding. 

Parallelization  of  Phase  II:  the  Computation  of  [l  r\ 

It  will  be  shown  that  the  computation  of  the  interval  [L  R } is  the  product  of  N 2 x 2 
matrices  formed  from  the  probabilities  Pk  and  Qk . Afterwards,  we  can  use  the  parallel  operation 
Parmult  to  multiply  the  N matrices  in  O(logiV)  time  on  N processors. 

Let  the  updated  values  of  L and  R at  iteration  k of  the  for-loop  of  Phase  II  of  the  algorithm 
‘Arithmetic-coding’  be  denoted  Lk  and  Rk , respectively.  Clearly,  Lk  — Lk_i+Pk(Rk_i  — Lk-\)  — 
(1  — Pk)Lk~ i + PkRk~ i,  and  Rk  = Lk  + Qk{Rk-\  — Lk_i)  = (1  — Pk)Lk- 1 + PkRk- 1 -F Qk(Rk- 1 — 
Lk- 1)  = (1  - Pfc  - Qk)Lk- 1 + ( Pk  + Qk)Rk- 1-  In  summary,  we  have 

Lk  = (1  — Pk)Lk- 1 + PkRk- 1 and  Rk  — (1  — Pk  — Qk)Lk-i  + ( Pk  + Qk)Rk-\.  (1) 
Letting 


' Lk  ‘ 

and  Ak  = 

' 1 -Pk  Pk' 

Rk 

1 — Pk  ~ Qk  Pk  + Qk 

equation  1 becomes  a simple  vector  recurrence  relation  of  order  1: 

^ - AkXk_ i.  (3) 


The  last  equation  implies  that  the  last  subinterval  [ L i?]  = [Ljv-i  Rn-i]  that  is  being  sought, 
which  corresponds  to  Vjv-i,  is  XN_  i — AN- 1 Ajv_2  - - • Aq-X-i,  or  equivalently, 


Ln-i 

Rn-i 


— AN_iAN- 2 


Ac 


(4) 
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Since  X-\  = [L_ i = [0  lF,  it  follows  that  [L^-i  is  the  right  column  of  the  prod- 

uct matrix  AN_xAN-2"  ' A0.  That  product  is  clearly  computable  with  the  parallel  operation 
Parmult(.4Ar_i:0),  taking  0(log  N)  time  on  N processors,  as  indicated  in  section  2.  The  whole 
parallel  algorithm  for  arithmetic  coding  can  now  be  put  together  as  follows. 


Algorithm  Parallel-arithmetic-coding(input:  x[0  : N — 1];  output:  B ) 
begin 

Form  the  array  Y[0  : N — 1]  of  tuples; 

Parsort(Y;  Z,7r); 

Compute-probs(Z,  7t;  Qk’s,  Pk,x[k]  s); 

Compute-cumprobs(Z,  7 r,  Qk s,  Pk,x[k]  s;  Pk  s); 

Route  permutation  7 r to  send  Pn[k]  and  Qn[k\  from  processor  k to  processor  7r(k),  for  all  /c; 

for  /c  = 0 to  N — 1 pardo 

r i - pk  pk' 

* - [ 1 - Pk  - Qt  Pk  + Qk  \ ; 

endfor 

C =Parmult(Ajv-i:o); 

L = C(l,  2);  R = C(2,  2); 

^ = M°gn(#-£)1; 

Take  the  n-ary  representation  of  L = 0 .LiL2~.Lr...; 

B = [Lx  L2  ...  Lr\ ; 

end 

Time  of  the  whole  algorithm:  Based  on  the  preceding  time  analyses,  the  overall  parallel 
time  of  the  algorithm  is  C^log2  N)  on  an  N- processor  hypercube.  Indeed,  the  parallel  sorting 
is  what  dominates  the  time,  for  otherwise,  the  algorithm  takes  0(log  N)  time. 

Arithmetic  Decoding 

Arithmetic  decoding,  which  reconstructs  the  string  x from  the  stream  B and  the  probabil- 
ities, is  much  harder  to  parallelize.  It  works  as  follows.  The  interval  [L  R ] is  narrowed  down 
progressively  as  in  coding,  where  the  initial  value  is  [0  1].  The  final  interval,  call  it  [Lf,Rf],  is 
known  at  decoding  time  from  the  stream  B:  D = nr  where  r is  the  length  of  the  stream  £, 
Lf  = (stream  B as  an  n-ary  number) /nr,  and  Rf  = Lj  + D.  To  figure  out  the  next  symbol  in 
the  file,  using  the  next  n-ary  digit  B[i\,  the  current  interval  [L  R\  is  divided  into  n subintervals 
as  in  coding,  one  subinterval  per  alphabet  symbol;  afterwards,  decode  B[i]  as  alphabet  symbol 
a,j  if  [Lj  Rf]  is  contained  within  the  j-th  subinterval.  Thus,  the  recurrence  relation  for  the 
decoded  symbols  involves  essentially  positional  rather  than  numerical  computations,  making  it 
hard  to  parallelize  its  computation. 

In  practice,  however,  arithmetic  coding  is  applied  in  a way  that  allows  for  some  decod- 
ing parallelism.  Because  of  accuracy  problems,  if  the  input  string  size  N is  fairly  large,  the 
intermediary  intervals  [L  R]  become  too  small  for  the  precision  afforded  by  most  computers. 
Therefore,  long  input  files  are  broken  into  several  blocks  of  lengths  that  do  not  lead  to  serious 
underflow  problems;  those  blocks  are  arithmetic-coded  independently,  except  perhaps  in  the 
statistics  gathering,  which  involves  the  whole  file  to  reduce  the  probability  model  information 
overhead  to  be  included  in  the  header  of  the  stream  B.  Accordingly,  the  streams  of  those  blocks 
can  be  decoded  independently  in  parallel.  The  actual  details  are  not  included  here,  and  will 
vary  from  application  to  application,  although  the  principle  is  the  same. 
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4 Parallel  Run-Length  Encoding 


Run-length  encoding  (RLE)  [12,  16]  applies  with  good  performance  when  the  input  string 
a:[0  : iV  — 1]  consists  of  a relatively  short  sequence  of  runs  (say  r runs,  r <<  TV),  where  a run 
is  a substring  of  consecutive  symbols  of  equal  value.  RLE  converts  x into  a sequence  of  pairs 
(L0,  Vq ),  ( Li , Vi),  ...  , (Lr_x,  Vr_i),  where  L{  is  the  length  of  the  z-th  run,  and  V*  is  the  value  of 
the  recurring  symbol  of  that  run. 

Often,  there  is  considerable  redundancy  in  the  values  of  the  L/s  and  the  V^’s,  and  certain  Li 
(or  Vi)  values  occur  more  frequently  than  others.  In  that  case,  Huffman  coding  [4]  is  applied  to 
code  the  Li  s and/or  the  Vi’s.  Parallelizing  Huffman  coding  is  the  subject  of  the  next  section. 
However,  in  the  parallel  RLE  algorithm,  we  will  put  the  data  in  the  right  form  and  locations. 

The  parallel  RLE  algorithm  coincides  with  the  first  3 steps  of  the  algorithm  ‘Compute-probs’ 
that  was  developed  earlier  for  arithmetic  coding.  The  segments  there  correspond  to  runs  in 
RLE.  After  those  steps  execute,  each  right  barrier  of  a segment  has  the  L and  V of  its  run. 
Afterwards,  the  scattered  (L,  V)  pairs  should  be  gathered  to  the  first  r processors  in  the  system, 
in  case  further  processing  is  needed,  as  for  example  Huffman  coding  the  Li  s and  the  V^’s.  The 
parallel  algorithm  for  RLE  can  now  be  given  as  follows. 


Algorithm  Parallel-RLE(input:  :r[0  : N — 1];  output:  L,  V) 
begin 

1.  Put  a left-barrier  at  k = 0 of  x,  and  a right-barrier  at  k — N — 1; 

2.  for  k = 0 to  N — 2 pardo  /*  put  barriers  around  the  runs  of  x */ 

3.  if  x[k]  ^ x[k  + 1],  then  put  a left-barrier  at  k and  a right-barrier  at  k + 1; 

endfor 

4.  Let  #[0  : N — 1]  be  an  integer  array  where  every  term  is  initialized  to  1; 

5.  G[0  : N — l]=Barrier-Parprefix(^); 

/*  if  k is  a right-barrier,  G[/c]  is  the  length  of  the  corresponding  run  */ 

6.  Let  h[0  : N — 1]  be  an  integer  array  initialized  to  0; 

7.  for  k = 0 to  N — 1 pardo 

8.  if  k is  a right-barrier,  set  h[k\  = 1; 

endfor 

9.  H[ 0 : N - l]=Barrier-Parprefix(/i); 

/*  when  k is  a right-barrier  and  i = H[k]  — 1,  the  corresponding  run  is  the  z-th  run  of  x*/ 

10.  for  k = 0 to  N — 1 pardo 

11.  if  k is  a right-barrier  then 

12.  i = H[k ] - 1 -Li  = G[k ];  V,  - x[% 

13.  Processor  k sends  (LZ-,V[)  to  processor  z; 

endif 

endfor 

end 
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Time  Analysis  of  Parallel- RLE 

The  system  is  assumed  to  be  an  TV-processor  hvpercube. 

Steps  1-3  and  7-8  for  barrier  setting,  as  well  as  steps  4 and  6,  take  0(1)  parallel  time. 
Steps  5 and  9,  being  Barrier-Parprefix,  take  O(logTV)  time.  The  computation  in  steps  10- 
12  takes  0(1)  time,  while  the  communication  in  those  steps,  which  is  a partial-permutation 
routing,  takes  O(logTV)  time  using  Valiant’s  randomized  routing  algorithm.  Therefore,  the 
wrhole  algorithm  takes  O(logTV)  time. 


Parallel  Run-length  Decoding 

To  perform  run-length  decoding  (RLD),  we  start  from  the  (Li,  V*)’s  as  input,  where  (Li,  V) 
is  in  processor  i,  for  i = 0, 1, ...,  r — 1.  RLD  reconstructs  the  original  string  x,  where  the  first 
Lq  symbols  are  all  Vo,  the  next  L\  symbols  are  all  Vi,  and  so  on.  The  algorithm  determines  the 
start  and  end  locations  of  each  run.  Run  0 starts  at  location  0 and  ends  at  location  L0  — 1, 
run  1 starts  at  location  L0  and  ends  at  location  L0  + Li  — 1,  and  generally,  run  i starts  at 
location  S[i  — 1]  = L0  + L\  + ...  + L*_i  and  ends  at  location  L0  + L\  + ...  + L*_i  + L*  — 1. 
All  those  prefix  sums  of  L are  computed  with  Parprefix(L)  in  O(logr)  time  on  r processors. 
Afterwards,  for  i = 1,  2, ...,  r — 1,  processor  i must  send  ( Lt , Vi)  to  processor  Lq  + L\  + ...  + L;_i; 
the  sending  of  those  r — 1 messages  is  a partial-permutation  routing  that  takes  O(logV)  time 
on  the  hypercube.  Finally,  those  recipients  of  the  (Lj,  V^)’ s,  including  processor  0 which  has 
(L0,  V0),  broadcast  their  value  V^  to  the  next  Li  — 1 processors,  completing  the  decoding.  Those 
r broadcasts  run  in  nonoverlapping  parts  of  the  hypercube,  taking  O (max  ({log  Lt})  = O(logTV) 
time,  and  thus  the  whole  algorithm,  summarized  below,  takes  0(\ogN)  time. 


Algorithm  Parallel-RLD (input:  Lo:r-i,  Vo:r-i;  output:  x ) 
begin 

5[0,  r — l]=Parprefix(L0:r_i);  /*  S[i\,Li,  and  V^  are  in  processor  i */ 

for  2 = 1 to  v — 1 pardo 

Processor  i sends  (Li,  Vi)  to  processor  5[i  — 1]; 

endfor 

for  2 = 0 to  r — 1 pardo 

Processor  s = — 1]  broadcasts  Vi  to  processors  s + 1,  s + 2, ...,  s -t-  L,  — 1; 

for  j = — 1]  to  S[i  — 1]  + Li  — 1 pardo 

Processor  j sets  x[j]  = V^; 

endfor 

endfor 

end 


5 Parallel  Huffman  Coding 

In  Huffman  coding  the  individual  symbols  of  the  alphabet  are  coded  in  binary  using  the  fre- 
quencies (or  probabilities)  of  occurrences  of  the  symbols,  such  that  no  symbol  code  is  the  prefix 
of  another  symbol  code.  Afterwards,  the  input  file  (or  string  x[0  : TV  — 1])  is  coded  by  replacing 
each  symbol  xfT]  by  its  code. 

The  Huffman  coding  algorithm  for  coding  the  alphabet  is  a greedy  algorithm  and  works 
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as  follows.  Suppose  that  the  alphabet  is  {ao>  ai, an_i},  and  let  pi  be  the  probability  of 
occurrence  of  symbol  a*,  for  i = 0, 1, n — 1.  A Huffman  binary  tree  is  built  by  the  algorithm. 
First,  a node  is  created  for  each  alphabet  symbol;  afterwards,  the  algorithm  repeatedly  selects 
two  unparented  nodes  of  smallest  probabilities,  creates  a new  parent  node  for  them,  and  makes 
the  probability  of  the  new  node  to  be  the  sum  of  the  probabilities  of  its  two  children.  Once 
the  root  is  created,  the  edges  of  the  tree  are  labeled,  left  edges  with  0,  right  edges  with  1. 
Finally,  each  symbol  is  coded  with  the  binary  sequence  that  labels  the  path  from  the  root  to 
the  leaf  node  of  that  symbol.  By  creating  a min-heap  for  the  original  leaves  (according  to  the 
probabilities),  the  repeated  insertions  and  deletions  on  the  heap  will  take  O(nlogn)  time.  The 
labeling  of  the  tree  and  extractions  of  the  leaf  codes  take  0(n  log  n)  time  as  well.  Therefore, 
the  whole  algorithm  for  alphabet  coding  takes  0(n  log  n)  time.  Considering  that  the  alphabet 
tends  to  be  very  small  in  size,  and  independent  of  the  — much  larger  — size  of  the  input  files 
to  be  coded,  the  O(nlogn)  is  relatively  very  small,  and  can  be  even  treated  as  constant  when 
measuring  the  time  of  coding  the  whole  input  file. 

What  should  not  be  considered  constant  is  the  time  for  statistics  gathering,  i.e.,  for  comput- 
ing the  probabilities  p^  s.  This  process  is  parallelizable  as  was  done  in  the  previous  two  sections: 
sort  the  input  string  in  parallel  using  Parsort,  then  use  Barrier-Parprefix  to  compute  the 
frequencies  of  the  distinct  symbols  in  the  input  string.  Those  frequencies  are  then  divided  by  N 
to  obtain  the  probabilities,  although  this  step  is  unnecessary  since  Huffman  coding  would  give 
the  same  results  if  it  uses  frequencies  instead  of  probabilities.  The  statistics  gathering  process 
clearly  takes  0(log2  N)  parallel  time. 

Once  the  symbol  codes  have  been  determined,  each  symbol  x[i]  is  replaced  by  its  code, 
and  all  symbols  are  so  processed  in  parallel.  The  concatenation  of  all  the  symbol  codes  is  the 
output  bitstream.  This  code  replacement  process  takes  0(1)  parallel  time,  since  the  length  of 
each  symbol  code  is  < n and  is  thus  a constant.  In  summary,  the  total  time  of  Huffman  coding 
an  input  file  of  N symbols  is  0(n  log  n + log2  N). 

Huffman  decoding  works  as  follows,  assuming  that  the  Huffman  tree  is  available.  The 
bitstream  is  scanned  from  left  to  right.  When  a bit  is  scanned,  we  traverse  the  Huffman  tree 
one  step  down,  left  if  the  bit  is  0,  right  if  the  bit  is  1.  Once  a leaf  is  encountered,  the  scanned 
substring  that  led  from  the  root  to  the  leaf  is  replaced  (decoded)  by  the  symbol  of  that  leaf. 
The  process  is  repeated  by  resetting  the  traversal  to  start  from  the  root,  while  the  scanning 
continues  from  where  it  left  off.  Clearly,  this  decoding  process  is  very  hard  to  parallelize,  and 
it  may  be  inherently  sequential.  No  attempt  is  made  here  to  prove  that. 

One  approach  can  be  followed  to  bring  some  parallelism  into  Huffman  decoding.  In  many 
applications  and  compression  standards  such  as  JPEG,  MPEG2,  and  the  upcoming  MPEG4, 
the  data  is  divided  into  blocks  at  some  stage  in  the  compression  process,  and  the  blocks  are 
quantized  then  entropy-coded  independently  of  one  another.  The  bitstreams  of  those  blocks 
are  then  concatenated  into  a single  bit  stream  according  to  some  static  ordering  scheme  of  the 
blocks.  A special  End-of-Block  (EOB)  symbol  is  added  to  the  alphabet  and  entropy-coded  like 
other  symbols;  the  EOB  symbol  tells  the  decoder  when  a block  ends  and  the  next  begins.  If 
parallelization  is  needed,  the  bitstreams  of  the  various  blocks  should  NOT  be  concatenated  into 
one  single  stream.  Rather,  they  should  be  formed  into  as  many  separate  streams  as  there  are 
processors  to  be  used  for  decoding.  That  way,  the  separate  streams  can  be  decoded  indepen- 
dently in  parallel.  By  making  the  many  streams  to  be  of  roughly  equal  length,  the  decoding 
processes  could  be  load  balanced,  leading  to  nearly  optimal  parallel  decoding.  The  details  of 
that  approach,  and  the  actual  structure  of  the  file  that  contains  the  separate  bitstreams,  are 
left  to  future  work. 
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6 Conclusions 


In  this  paper  we  developed  parallel  algorithms  for  several  widely  used  entropy  coding  techniques, 
namely,  arithmetic  coding,  run-length  encoding,  and  Huffman  coding.  In  all  three,  the  coding 
turned  out  to  be  parallelizable,  taking  mainly  0(log  N)  time  on  N processors,  except  in  the  cases 
where  sorting  was  used  for  statistics  gathering,  requiring  0(log2  N)  time.  Decoding,  however, 
turned  out  to  be  much  harder  to  parallelize,  except  in  the  RLE  case  which  is  logarithmic  in  time. 
In  practice,  however,  both  arithmetic  and  Huffman  coding  are  used  in  such  a way  that  allows 
for  simple  parallel  decoding.  The  details  of  parallelizing  Huffman  decoding  and  arithmetic 
decoding,  and  the  performance  of  those  algorithms,  are  the  subject  of  further  research. 
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